NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

https://doi.org/10.1016/j.future.2024.07.022

Bautista-Gomez, Leonardo; Benoit, Anne; Di, Sheng; Herault, Thomas; Robert, Yves; Sun, Hongyang (December 2024, Future Generation Computer Systems)

Full Text Available
Concealing Compression-accelerated I/O for HPC Applications through In Situ Task Scheduling

https://doi.org/10.1145/3627703.3629573

Jin, Sian; Di, Sheng; Vivien, Frédéric; Wang, Daoce; Robert, Yves; Tao, Dingwen; Cappello, Franck (April 2024, ACM)

Full Text Available
Online Scheduling of Moldable Task Graphs under Common Speedup Models

https://doi.org/10.1145/3545008.3545049

Benoit, Anne; Perotin, Lucas; Robert, Yves; Sun, Hongyang (August 2022, ACM)
Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms

https://doi.org/10.15803/ijnc.12.1_26

Bosilca, George; Bouteiller, Aurélien; Herault, Thomas; Fèvre, Valentin Le; Robert, Yves; Dongarra, Jack (January 2022, International Journal of Networking and Computing)

Full Text Available
Revisiting Credit Distribution Algorithms for Distributed Termination Detection

https://doi.org/10.1109/IPDPSW52791.2021.00095

Bosilca, George; Bouteiller, Aurelien; Herault, Thomas; Le Fevre, Valentin; Robert, Yves; Dongarra, Jack (June 2021, IEEE)
null (Ed.)
This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.
more » « less
Full Text Available
Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure

https://doi.org/10.1109/IPDPS49936.2021.00062

Herault, Thomas; Robert, Yves; Bosilca, George; Harrison, Robert J.; Lewis, Cannada A.; Valeev, Edward F.; Dongarra, Jack J. (May 2021, 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW))
null (Ed.)
Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-to-solution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.
more » « less
Full Text Available
A comparison of several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication

Le Fèvre, Valentin; Herault, Thomas; Langou, Julien; Robert, Yves (January 2020, Euro-Par 2020: Parallel Processing Workshop)

Full Text Available
Computing the expected makespan of task graphs in the presence of silent errors

https://doi.org/10.1016/j.parco.2018.03.004

Casanova, Henri; Herrmann, Julien; Robert, Yves (July 2018, Parallel Computing)

Full Text Available
Checkpointing Workflows for Fail-Stop Errors

https://doi.org/10.1109/CLUSTER.2017.14

Han, Li; Canon, Louis-Claude; Casanova, Henri; Robert, Yves; Vivien, Frederic (September 2017, IEEE Cluster)

We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-gaphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter.
more » « less
Full Text Available
A failure detector for HPC platforms

https://doi.org/10.1177/1094342017711505

Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina; Herault, Thomas; Robert, Yves; Sens, Pierre; Dongarra, Jack (August 2017, The International Journal of High Performance Computing Applications)

Full Text Available

Search for: All records